Integration of single-cell multi-omics data by regression analysis on unpaired observations

Despite recent developments, it is hard to profile all multi-omics single-cell data modalities on the same cell. Thus, huge amounts of single-cell genomics data of unpaired observations on different cells are generated. We propose a method named UnpairReg for the regression analysis on unpaired observations to integrate single-cell multi-omics data. On real and simulated data, UnpairReg provides an accurate estimation of cell gene expression where only chromatin accessibility data is available. The cis-regulatory network inferred from UnpairReg is highly consistent with eQTL mapping. UnpairReg improves cell type identification accuracy by joint analysis of single-cell gene expression and chromatin accessibility data. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-022-02726-7.


Simulation pipeline
We simulate the scATAC-seq data according to the method proposed in ref [35] taking bulk ATAC-seq count matrix as input. We take Lun [37] as reference to simulate scRNA-seq data.
The detail is described as follows.
We generate the bulk ATAC-seq count matrix by a discrete distribution which is calculated from the ENCODE ATAC-seq dataset [36]. We gain a 5,000 by 10 count matrix , in which 10 represents 10 cell types (samples) and 5000 is the number of peaks.
(2) Generate the real single cell ATAC-seq data.
We assign the number of cells included in each cell type first and get M cells across all cell types in total. We permute 500 peaks for each cell from the bulk count matrix, generating a 5000 by M single cell count matrix . We generate real single cell ATAC-seq ∈ [0,1] of a single cell for the cell type in peak as Where denotes the probability of single cell being accessible in peak , N denotes the number of reads in peaks for each cell.
(3) Generate the observed single cell ATAC-seq data.
We define the observed accessibility ∈ {0,1,2} as Where defines the probability that a count will be observed in peak for a single cell, q denotes the noise parameter (0.22 for our study [35]).
Here, we define as a dropout if = 0 and > 0. The dropout rate is defined as the number of dropouts divided by the number of nonzero values in matrix.
We simulate cis-regulatory coefficient by sparse normal distribution.
(5) Simulate single cell RNA-seq real data. We define the real single cell RNA-seq data = � � × of the gene k in cell j as = (6) Generate the observed single cell RNA-seq data.
To generate a same dropout rate as ATAC-seq data, we random set the same proportion of C as zero. We denote the new matrix as ̃.
We define the observed gene expression as where is a dispersion parameter. We set this dispersion parameter as 0.1 according to ref (7) Generate the unpaired data.
The unpaired data is generated by random selecting half cells from the scRNA-seq data and separating other cells from scATAC-seq data.

Simulation result
According to this procedure, we assign 5 proportions of cell type and generate 5 simulation datasets. For each dataset, we calculate the drop-out rate of ATAC-seq data by cell depth parameter N. The cell type size is listed in the Table S1. In Dataset 1, we assign an equal proportion of cell types. The number of cell types for Dataset 2 follows a negative binomial distribution whose mean is 1000. We set 1-3 minor populations which contain 100 cells for Dataset 3-5.
Taking the unpaired data as input, we first estimate cis-regulatory coefficients by UnpairReg.
Since we know the real cis-regulatory coefficient in this simulation data, we compare the estimated coefficient with the ground truth by calculating Pearson Correlation Coefficients (PCC) to evaluate the coefficient estimation. Figure 2A shows the result of simulation dataset 1 and To evaluate the performance of gene expression prediction, we take the real gene expression data as ground truth and calculate the PCC between our prediction and ground truth. We compare the results with the observed gene expression data, which is affected by dropout. We also compared the whole distributions of PCCs from UnpairReg with that from observed gene expression data as well as a gene expression prediction based on a random cis-regulatory coefficient at the dropout rate of 0.87 ( Figure 2D-E and Fig. N1 B and C).
We observe a remarkable difference between our prediction and the other two predictions of cell level PCC ( Figure 2D and Fig. N1B) and gene level PCC ( Figure 2E and Fig. N1C

Algorithm performance
Based on simulation data (Dataset 1), we evaluate the UnpairReg algorithm by convergence, initial value performance, sensitivity to the initial value, running time, and memory usage.
(1) Algorithm convergence: We iterate 100 steps for a UnpairReg initial cis-regulatory coefficient and a random one, respectively, gaining the cost function after each iteration to test the algorithm's convergence.
As is shown in Fig. N2A, the cost function monotonously decreases along with the iteration.
(2) Performance of the initial cis-regulatory coefficient: Fig. N2A shows a huge gap in the cost function in the first iteration between the UnpairReg (9909.2) and random (1839.5) initial cis-regulatory coefficient, which suggests UnpairReg gives a much better initial value than a random one.
(3) Sensitivity of the initial cis-regulatory coefficient: The cost function of the random case decreases sharply in the first five iterations (from 9909.2 to 1976.5). After the 100 th iteration, the cost function of random and UnpairReg initial cases are 1537.3 and 1515.2, respectively, which is close to each other. So UnpairReg is not sensitive to the given initial cis-regulatory coefficient.
(4) Running time Fixing the number of peaks as 5,000, we test the computational time (running time) under different numbers of genes from 5,000 to 10,000. Fig. N2B shows that running time is linear increasing along with the number of genes. Next, we fix the number of genes as 1,000 and test the running time for different numbers of peaks from 5,000 to 10,000. Running time is linear increasing along with the number of genes (Fig. N2C).
We compare the running time of our algorithm and the quasi-Newton algorithm. We assign 500 peaks, 100 genes, and 10,000 cells, generating simulation data. We perform the UnpairReg and quasi-Newton algorithms, which take 0.25s and 42.94s, respectively, to converge under the same terminate condition.